AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)pip install --upgrade scikit-learn
Requirement already satisfied: scikit-learn in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (1.0.2) Requirement already satisfied: numpy>=1.14.6 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scikit-learn) (1.21.6) Requirement already satisfied: scipy>=1.1.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scikit-learn) (1.7.3) Requirement already satisfied: joblib>=0.11 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scikit-learn) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from scikit-learn) (3.1.0) Note: you may need to restart the kernel to use updated packages.
# %load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
precision_recall_curve,
roc_curve,
make_scorer,
)
Loan_data = pd.read_csv("Loan_Modelling.csv")
data = Loan_data.copy()
round(data.describe().T)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.0 | 1444.0 | 1.0 | 1251.0 | 2500.0 | 3750.0 | 5000.0 |
| Age | 5000.0 | 45.0 | 11.0 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.0 | 11.0 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 74.0 | 46.0 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.0 | 1759.0 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 2.0 | 2.0 | 0.0 | 1.0 | 2.0 | 2.0 | 10.0 |
| Education | 5000.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.0 | 102.0 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.shape
(5000, 14)
There are total 5000 Records in the Data set
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
# checking for unique values in ID column
data["ID"].nunique()
5000
Since all the 5000 ID are Unique , we can drop the ID column
##Dropping the ID col
data.drop("ID",axis=1,inplace=True)
#checking the count for negative Experience
negExp = data.Experience < 0
negExp.value_counts()
False 4948 True 52 Name: Experience, dtype: int64
# Checking all the negative values present in the Experience column
data[data['Experience'] < 0]['Experience'].value_counts()
-1 33 -2 15 -3 4 Name: Experience, dtype: int64
There are total 5000
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
# Checking if any negative values still present in the Experience column
data[data['Experience'] < 0]['Experience'].value_counts()
Series([], Name: Experience, dtype: int64)
# Let's map the values to 1: Undergrad; 2: Graduate 3: Advanced/Professional
data["Education"].replace(1, "Undergraduate", inplace=True)
data["Education"].replace(2, "Graduate", inplace=True)
data["Education"].replace(3, "Professional", inplace=True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.134600 | 11.415189 | 0.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.0 | 93437.0 | 94608.0 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
# checking the number of uniques in the zip code
data["ZIPCode"].nunique()
467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode: 7
Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes) Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes) CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes) Online: Do customers use internet banking facilities? (0: No, 1: Yes) CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
# converting categorical varaible to category type
category_col = ['Education','Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard','ZIPCode']
data[category_col] = data[category_col].astype('category')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null category 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(7), float64(1), int64(5) memory usage: 269.8 KB
Questions:
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
histogram_boxplot(data, "Age")
histogram_boxplot(data,'Experience')
histogram_boxplot(data,'Income')
histogram_boxplot(data,'CCAvg',kde=True)
histogram_boxplot(data,'Mortgage')
labeled_barplot(data, "Family", perc=True)
labeled_barplot(data,'Education') ## Complete the code to create labeled_barplot for Education
labeled_barplot(data,'Securities_Account') ## Complete the code to create labeled_barplot for Securities_Account
labeled_barplot(data,'CD_Account') ## Complete the code to create labeled_barplot for CD_Account
labeled_barplot(data,'Online',perc=True) ## Complete the code to create labeled_barplot for Online
labeled_barplot(data,'CreditCard',perc=True) ## Complete the code to create labeled_barplot for CreditCard
labeled_barplot(data,'ZIPCode') ## Complete the code to create labeled_barplot for ZIPCode
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Complete the code to get the heatmap of the data
plt.show()
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 Professional 1296 205 1501 Graduate 1221 182 1403 Undergraduate 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"Personal_Loan","Family") ## Complete the code to plot stacked barplot for Personal Loan and Family
Family 1 2 3 4 All Personal_Loan All 1472 1296 1010 1222 5000 0 1365 1190 877 1088 4520 1 107 106 133 134 480 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"Personal_Loan","Securities_Account") ## Complete the code to plot stacked barplot for Personal Loan and Securities_Account
Securities_Account 0 1 All Personal_Loan All 4478 522 5000 0 4058 462 4520 1 420 60 480 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"Personal_Loan","CD_Account") ## Complete the code to plot stacked barplot for Personal Loan and CD_Account
CD_Account 0 1 All Personal_Loan All 4698 302 5000 0 4358 162 4520 1 340 140 480 ------------------------------------------------------------------------------------------------------------------------
sns.pairplot(data=data[['Age','Income','ZIPCode','CCAvg','Mortgage','Experience','Personal_Loan']],hue='Personal_Loan');
plt.figure(figsize = (8,8))
sns.catplot(x='Family', y='Income', hue='Personal_Loan', data = data, kind='swarm')
<seaborn.axisgrid.FacetGrid at 0x7fb95953f990>
<Figure size 800x800 with 0 Axes>
sns.catplot(x='Education', y='Income', hue='Personal_Loan', data = data, kind='swarm')
sns.catplot(x='Age', y='Experience', hue='Personal_Loan', data = data, kind='swarm')
sns.catplot(x="CreditCard", y='CCAvg', hue="Personal_Loan", data=data,kind='swarm')
Q1 = data.quantile(.25) # Complete the code to find the 25th percentile and 75th percentile.
Q3 = data.quantile(.75) # Complete the code to find the 75th percentile and 75th percentile.
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = Q1 - 1.5 * IQR # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
((data.select_dtypes(include=["float64", "int64"]) < lower)
|(data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Age 0.00 Experience 0.00 Income 1.92 Family 0.00 CCAvg 6.48 Mortgage 5.82 dtype: float64
# Separate independent and dependent variable
X = data.drop(["Personal_Loan", "Experience"], axis=1)
y = data["Personal_Loan"]
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
# Complete the code to split data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(X.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(X_test.value_counts(normalize=True))
Shape of Training set : (3500, 17)
Shape of test set : (1500, 17)
Percentage of classes in training set:
Age Income Family CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96 Education_Professional Education_Undergraduate
36 80 4 2.20 0 0 0 1 0 0 0 0 1 0 0 0 0 0.0004
23 12 4 1.00 0 1 0 0 1 0 0 0 0 1 0 0 1 0.0002
52 61 4 1.80 207 0 0 0 0 0 1 0 0 0 0 1 0 0.0002
65 1 0.00 0 0 0 1 1 0 0 1 0 0 0 0 1 0.0002
64 2 1.00 211 0 0 1 0 0 0 0 0 1 0 1 0 0.0002
...
39 63 4 0.20 242 0 0 0 0 0 0 0 0 0 0 1 0 0.0002
62 4 2.40 86 0 0 0 0 0 0 1 0 0 0 0 1 0.0002
3 2.33 131 0 1 1 1 0 0 1 0 0 0 0 1 0.0002
0 0 0 0 0 1 0 0 0 0 0 0 0 0.0002
67 114 4 2.40 0 0 0 1 0 0 0 0 0 1 0 1 0 0.0002
Length: 4999, dtype: float64
Percentage of classes in test set:
Age Income Family CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96 Education_Professional Education_Undergraduate
23 13 4 1.0 84 0 0 1 0 0 0 0 1 0 0 0 1 0.000667
52 28 2 0.7 90 0 0 1 1 0 0 1 0 0 0 0 0 0.000667
39 4 0.2 0 1 0 1 0 0 1 0 0 0 0 0 0 0.000667
2 0.8 0 0 0 1 0 0 0 0 1 0 0 0 1 0.000667
0.7 166 0 0 1 0 0 0 0 1 0 0 0 0 0.000667
...
39 31 1 1.4 88 0 0 1 1 0 0 0 0 1 0 1 0 0.000667
30 3 0.2 0 0 0 0 0 0 1 0 0 0 0 0 0 0.000667
29 3 2.0 151 1 0 0 0 0 0 0 1 0 0 1 0 0.000667
25 3 0.2 0 0 0 1 0 0 0 0 0 1 0 0 0 0.000667
67 105 4 1.7 0 0 0 1 0 0 0 1 0 0 0 0 0 0.000667
Length: 1500, dtype: float64
# Initialize the Decision Tree Classifier
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train,y_train) ## Complete the code to fit decision tree on train data
DecisionTreeClassifier(random_state=1)
Model can make wrong predictions as:
Predicting a customer will take the personal loan but in reality the customer will not take the personal loan - Loss of resources Predicting a customer will not take the personal loan but in reality the customer was going to take the personal loan - Loss of opportunity Which case is more important?
Losing a potential customer by predicting that the customer will not be taking the personal loan but in reality the customer was going to take the personal loan. How to reduce this loss i.e need to reduce False Negatives?
Bank would want Recall to be maximized, greater the Recall higher the chances of minimizing false negatives. Hence, the focus should be on increasing Recall or minimizing the false negatives.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Checking model performance on training data
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(model, X_train, y_train)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Checking model performance on test set
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981333 | 0.899329 | 0.911565 | 0.905405 |
confusion_matrix_sklearn(model, X_test, y_test)
Visualizing the Decision Tree
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_Professional', 'Education_Undergraduate']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_93 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_Undergraduate <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Education_Undergraduate > 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_Professional <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_Professional > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_93 > 0.50 | | | | | |--- Age <= 37.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 37.50 | | | | | | |--- CCAvg <= 1.10 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 1.10 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- ZIPCode_92 <= 0.50 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- ZIPCode_92 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- Age <= 33.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- Age <= 45.50 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Age > 45.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- CCAvg <= 4.05 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- CCAvg > 4.05 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_Undergraduate <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- Education_Professional <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Education_Professional > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Age <= 36.50 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Age > 36.50 | | | | | | | |--- Mortgage <= 284.50 | | | | | | | | |--- CCAvg <= 4.95 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 4.95 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Mortgage > 284.50 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_Undergraduate > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- Age <= 55.00 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Age > 55.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_Undergraduate <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1
<function matplotlib.pyplot.show(close=None, block=None)>
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ZIPCode_93 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_Undergraduate <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Education_Undergraduate > 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- CCAvg <= 2.20 | | | | | | | |--- weights: [48.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.20 | | | | | | | |--- Education_Professional <= 0.50 | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- Education_Professional > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- ZIPCode_93 > 0.50 | | | | | |--- Age <= 37.50 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 37.50 | | | | | | |--- CCAvg <= 1.10 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 1.10 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- ZIPCode_92 <= 0.50 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- ZIPCode_92 > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- Age <= 33.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Age > 33.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family <= 3.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family > 3.00 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Income <= 81.50 | | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | | | |--- Income > 81.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- Age <= 45.50 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Age > 45.50 | | | | | | | | | |--- Family <= 3.50 | | | | | | | | | | |--- CCAvg <= 4.05 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- CCAvg > 4.05 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Family > 3.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_Undergraduate <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- Education_Professional <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- Education_Professional > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Age <= 36.50 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Age > 36.50 | | | | | | | |--- Mortgage <= 284.50 | | | | | | | | |--- CCAvg <= 4.95 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 4.95 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Mortgage > 284.50 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_Undergraduate > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- Age <= 55.00 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- Age > 55.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_Undergraduate <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_Undergraduate 0.403732 Income 0.304761 Family 0.161717 CCAvg 0.053107 Age 0.036035 CD_Account 0.025711 Mortgage 0.005557 Education_Professional 0.005144 ZIPCode_92 0.003080 ZIPCode_93 0.000594 Online 0.000561 Securities_Account 0.000000 ZIPCode_91 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 CreditCard 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, random_state=1)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.990286 | 0.927492 | 0.968454 | 0.947531 |
confusion_matrix_sklearn(estimator, X_train, y_train)
Checking model performance on test set
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.865772 | 0.928058 | 0.895833 |
Visualizing the Decision Tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- weights: [2632.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_Undergraduate <= 0.50 | | | | |--- Age <= 63.50 | | | | | |--- weights: [9.00, 28.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Education_Undergraduate > 0.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 |--- Income > 116.50 | |--- Education_Undergraduate <= 0.50 | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education_Undergraduate 0.446191 Income 0.327387 Family 0.155083 CCAvg 0.042061 CD_Account 0.025243 Age 0.004035 Securities_Account 0.000000 Online 0.000000 Mortgage 0.000000 ZIPCode_91 0.000000 ZIPCode_92 0.000000 ZIPCode_93 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 Education_Professional 0.000000 CreditCard 0.000000
# importances = estimator.feature_importances_
# indices = np.argsort(importances)
# plt.figure(figsize=(10, 10))
# plt.title("Feature Importances")
# plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
# plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
# plt.xlabel("Relative Importance")
# plt.show()
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000214 | 0.000429 |
| 2 | 0.000223 | 0.001542 |
| 3 | 0.000242 | 0.002750 |
| 4 | 0.000268 | 0.003824 |
| 5 | 0.000359 | 0.004900 |
| 6 | 0.000381 | 0.005280 |
| 7 | 0.000381 | 0.005661 |
| 8 | 0.000381 | 0.006042 |
| 9 | 0.000381 | 0.006423 |
| 10 | 0.000435 | 0.006859 |
| 11 | 0.000476 | 0.007335 |
| 12 | 0.000527 | 0.007862 |
| 13 | 0.000578 | 0.010176 |
| 14 | 0.000582 | 0.010758 |
| 15 | 0.000621 | 0.011379 |
| 16 | 0.000769 | 0.014456 |
| 17 | 0.000882 | 0.017985 |
| 18 | 0.001552 | 0.019536 |
| 19 | 0.002333 | 0.021869 |
| 20 | 0.003024 | 0.024893 |
| 21 | 0.003294 | 0.028187 |
| 22 | 0.006473 | 0.034659 |
| 23 | 0.023866 | 0.058525 |
| 24 | 0.056365 | 0.171255 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006209286209286216, random_state=1)
Most accurate value of alpha per the graph looks like b/w .002 and .003 that's where train and test looks like best fitted. We will choose .003 for our calculations
estimator_2 = DecisionTreeClassifier(
ccp_alpha=.003, class_weight={0: 0.15, 1: 0.85}, random_state=1 ## Complete the code by adding the correct ccp_alpha value
)
estimator_2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.003, class_weight={0: 0.15, 1: 0.85},
random_state=1)
decision_tree_postpruned_perf_train = model_performance_classification_sklearn(
estimator_2, X_train, y_train
)
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968286 | 0.975831 | 0.758216 | 0.853369 |
confusion_matrix_sklearn(estimator_2, X_train, y_train)
decision_tree_postpruned_perf_test = model_performance_classification_sklearn(
estimator_2, X_test, y_test
)
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.955333 | 0.926174 | 0.71134 | 0.804665 |
confusion_matrix_sklearn(estimator_2, X_test, y_test)
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [11.70, 11.90] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Education_Undergraduate <= 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- weights: [11.85, 5.95] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- weights: [1.50, 19.55] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- weights: [0.45, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- weights: [67.20, 0.85] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [11.70, 11.90] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Education_Undergraduate <= 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- weights: [11.85, 5.95] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- weights: [1.50, 19.55] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- weights: [0.45, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- weights: [67.20, 0.85] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [11.70, 11.90] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Education_Undergraduate <= 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- weights: [11.85, 5.95] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- weights: [1.50, 19.55] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1 | |--- Education_Undergraduate > 0.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- weights: [0.45, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- weights: [67.20, 0.85] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.621892 Family 0.150502 Education_Undergraduate 0.134024 CCAvg 0.081623 CD_Account 0.011960 ZIPCode_93 0.000000 Education_Professional 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000 ZIPCode_94 0.000000 Age 0.000000 ZIPCode_92 0.000000 ZIPCode_91 0.000000 Online 0.000000 Securities_Account 0.000000 Mortgage 0.000000 CreditCard 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T,decision_tree_postpruned_perf_train.T], axis=1,
)
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)","Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.0 | 0.990286 | 0.968286 |
| Recall | 1.0 | 0.927492 | 0.975831 |
| Precision | 1.0 | 0.968454 | 0.758216 |
| F1 | 1.0 | 0.947531 | 0.853369 |
# testing performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_postpruned_perf_test.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_train_comp_df
Test set performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.981333 | 0.980000 | 0.955333 |
| Recall | 0.899329 | 0.865772 | 0.926174 |
| Precision | 0.911565 | 0.928058 | 0.711340 |
| F1 | 0.905405 | 0.895833 | 0.804665 |